Skip to content

[mlir][amdgpu] Introduce assume_subgroup_uniform op #152740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Hardcode84
Copy link
Contributor

assume_subgroup_uniform works as compiler hint to force the specific value into scalar register.
Currently implemented via readfirstlane intrinsic.
Unlike direct readfirstlane call, this op is potentially speculatable and have a usual arith and int range interfaces.

`assume_subgroup_uniform` works as compiler hint to force the specific value into scalar register.
Currently implemented via `readfirstlane` intrinsic.
Unlike direct `readfirstlane` call, this op is potentially speculatable and have a usual arith and int range interfaces.
@llvmbot
Copy link
Member

llvmbot commented Aug 8, 2025

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-mlir-gpu

Author: Ivan Butygin (Hardcode84)

Changes

assume_subgroup_uniform works as compiler hint to force the specific value into scalar register.
Currently implemented via readfirstlane intrinsic.
Unlike direct readfirstlane call, this op is potentially speculatable and have a usual arith and int range interfaces.


Full diff: https://github.com/llvm/llvm-project/pull/152740.diff

9 Files Affected:

  • (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+35-3)
  • (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h (+1)
  • (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+16-1)
  • (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+16)
  • (modified) mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir (+14)
  • (modified) mlir/test/Dialect/AMDGPU/canonicalize.mlir (+10)
  • (modified) mlir/test/Dialect/AMDGPU/ops.mlir (+10)
  • (added) mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir (+13)
  • (added) mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir (+21)
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index 2c646934c11c2..b0b94ed49f2e5 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -9,12 +9,13 @@
 #ifndef AMDGPU
 #define AMDGPU
 
+include "mlir/IR/EnumAttr.td"
+include "mlir/IR/OpBase.td"
+include "mlir/IR/Properties.td"
+include "mlir/Interfaces/InferIntRangeInterface.td"
 include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
 include "mlir/Interfaces/ViewLikeInterface.td"
-include "mlir/IR/EnumAttr.td"
-include "mlir/IR/Properties.td"
-include "mlir/IR/OpBase.td"
 
 def AMDGPU_Dialect : Dialect {
   let name = "amdgpu";
@@ -635,6 +636,37 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
   let hasVerifier = 1;
 }
 
+def AMDGPU_AssumeSubgroupUniformOp : AMDGPU_Op<"assume_subgroup_uniform",
+    [NoMemoryEffect, AllTypesMatch<["result", "src"]>,
+    DeclareOpInterfaceMethods<InferIntRangeInterface, ["inferResultRanges"]>,
+    DeclareOpInterfaceMethods<ConditionallySpeculatable, ["getSpeculatability"]>] #
+    ElementwiseMappable.traits>,
+  Arguments<(ins AnyType:$src,
+                 DefaultValuedAttr<UnitAttr, "false">:$all_lanes)> {
+  let summary = "Assumes value is unform across the lanes in subgroup";
+  let description = [{
+      This op is a compiler hint to help backend put values into scalar registers.
+
+      If `src` value is uniform across all the active subgroup lanes it is
+      returned unchanged, otherwise result is poison.
+
+      If `all_lanes` is set, the value is assumed to be uniform across all the
+      subgroup lanes, this can allow to speculate it out of control flow, which
+      may change the current active lanes, i.e:
+      ```
+      // %value must be uniform at this point
+      %value = ...
+      scf.if lane_id < 13 {
+        %uniform = amdgpu.assume_subgroup_uniform all_lanes %value
+      }
+      ```
+  }];
+  let results = (outs AnyType:$result);
+  let assemblyFormat = [{
+    (`all_lanes` $all_lanes^)? $src attr-dict `:` type($result)
+  }];
+}
+
 def AMDGPU_SwizzleBitModeOp : AMDGPU_Op<"swizzle_bitmode",
     [Pure, AllTypesMatch<["result", "src"]>]>,
   Arguments<(ins AnyIntegerOrFloatOr1DVector:$src,
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h
index 3de57c923178a..196ce08b5954c 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h
@@ -18,6 +18,7 @@
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/Dialect.h"
 #include "mlir/IR/OpDefinition.h"
+#include "mlir/Interfaces/InferIntRangeInterface.h"
 #include "mlir/Interfaces/InferTypeOpInterface.h"
 #include "mlir/Interfaces/SideEffectInterfaces.h"
 #include "mlir/Interfaces/ViewLikeInterface.h"
diff --git a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
index 64720bfe6cf50..3f52309005690 100644
--- a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
+++ b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
@@ -1876,6 +1876,19 @@ struct AMDGPUSwizzleBitModeLowering
   }
 };
 
+struct AMDGPUAssumeSubgroupUniformLowering
+    : public ConvertOpToLLVMPattern<AssumeSubgroupUniformOp> {
+  using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern;
+
+  LogicalResult
+  matchAndRewrite(AssumeSubgroupUniformOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    Value src = adaptor.getSrc();
+    rewriter.replaceOpWithNewOp<ROCDL::ReadfirstlaneOp>(op, src.getType(), src);
+    return success();
+  }
+};
+
 struct ConvertAMDGPUToROCDLPass
     : public impl::ConvertAMDGPUToROCDLPassBase<ConvertAMDGPUToROCDLPass> {
   using Base::Base;
@@ -1945,5 +1958,7 @@ void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,
            PackedScaledTruncOpLowering, PackedTrunc2xFp8OpLowering,
            PackedStochRoundFp8OpLowering, GatherToLDSOpLowering,
            TransposeLoadOpLowering>(converter, chipset);
-  patterns.add<AMDGPUSwizzleBitModeLowering>(converter);
+  patterns
+      .add<AMDGPUSwizzleBitModeLowering, AMDGPUAssumeSubgroupUniformLowering>(
+          converter);
 }
diff --git a/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp b/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
index d7ffdcb58ddb5..0115a85ba0bfe 100644
--- a/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
+++ b/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
@@ -510,6 +510,22 @@ LogicalResult DPPOp::verify() {
   return success();
 }
 
+//===----------------------------------------------------------------------===//
+// AssumeSubgroupUniformOp
+//===----------------------------------------------------------------------===//
+
+void AssumeSubgroupUniformOp::inferResultRanges(
+    ArrayRef<ConstantIntRanges> argRanges, SetIntRangeFn setResultRange) {
+  setResultRange(getResult(), argRanges.front());
+}
+
+Speculation::Speculatability AssumeSubgroupUniformOp::getSpeculatability() {
+  if (getAllLanes())
+    return Speculation::Speculatable;
+
+  return Speculation::NotSpeculatable;
+}
+
 //===----------------------------------------------------------------------===//
 // GatherToLDSOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir b/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
index cc1162d8b0de8..6eaf68f84e38f 100644
--- a/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
+++ b/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
@@ -461,3 +461,17 @@ func.func @sched_barrier() {
   amdgpu.sched_barrier allow = <valu|all_vmem>
   func.return
 }
+
+// CHECK-LABEL: func @assume_subgroup_uniform
+//  CHECK-SAME:   (%[[ARG:.*]]: index)
+func.func @assume_subgroup_uniform(%arg0 : index) -> (index, index) {
+//       CHECK:   %[[SRC:.*]] = builtin.unrealized_conversion_cast %[[ARG]] : index to i64
+//       CHECK:   %[[V1:.*]] = rocdl.readfirstlane %[[SRC]] : i64
+//       CHECK:   %[[RES1:.*]] = builtin.unrealized_conversion_cast %[[V1]] : i64 to index
+//       CHECK:   %[[V2:.*]] = rocdl.readfirstlane %[[SRC]] : i64
+//       CHECK:   %[[RES2:.*]] = builtin.unrealized_conversion_cast %[[V2]] : i64 to index
+//       CHECK:   return %[[RES1]], %[[RES2]] : index, index
+  %0 = amdgpu.assume_subgroup_uniform %arg0 : index
+  %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : index
+  func.return %0, %1 : index, index
+}
diff --git a/mlir/test/Dialect/AMDGPU/canonicalize.mlir b/mlir/test/Dialect/AMDGPU/canonicalize.mlir
index 5501ad42dbd90..141bd3f459738 100644
--- a/mlir/test/Dialect/AMDGPU/canonicalize.mlir
+++ b/mlir/test/Dialect/AMDGPU/canonicalize.mlir
@@ -159,3 +159,13 @@ func.func @fold_gather_to_lds_of_cast_dest(%global: memref<128x72xf32, 1>, %lds:
     : f32, memref<128x72xf32, 1>, memref<?x?xf32, 3>
   func.return
 }
+
+// -----
+
+// CHECK-LABEL: func @assume_subgroup_uniform_unused
+func.func @assume_subgroup_uniform_unused(%arg0 : f32) {
+// CHECK-NOT: amdgpu.assume_subgroup_uniform
+  %0 = amdgpu.assume_subgroup_uniform %arg0 : f32
+  %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : f32
+  func.return
+}
diff --git a/mlir/test/Dialect/AMDGPU/ops.mlir b/mlir/test/Dialect/AMDGPU/ops.mlir
index 87e11c028c62a..97b4d5f54506f 100644
--- a/mlir/test/Dialect/AMDGPU/ops.mlir
+++ b/mlir/test/Dialect/AMDGPU/ops.mlir
@@ -517,6 +517,16 @@ func.func @wmma(%arg0 : vector<16xf16>, %arg1 : vector<8xf16>) -> vector<8xf16>
   func.return %0 : vector<8xf16>
 }
 
+// CHECK-LABEL: func @assume_subgroup_uniform
+//  CHECK-SAME: (%[[ARG:.*]]: f32)
+func.func @assume_subgroup_uniform(%arg0 : f32) -> (f32, f32) {
+  // CHECK: amdgpu.assume_subgroup_uniform %[[ARG]] : f32
+  %0 = amdgpu.assume_subgroup_uniform %arg0 : f32
+  // CHECK: amdgpu.assume_subgroup_uniform all_lanes %[[ARG]] : f32
+  %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : f32
+  func.return %0, %1 : f32, f32
+}
+
 // CHECK-LABEL: func @swizzle_bitmode
 func.func @swizzle_bitmode(%arg0 : f32) -> f32 {
   // CHECK: amdgpu.swizzle_bitmode
diff --git a/mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir b/mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir
new file mode 100644
index 0000000000000..be20bfdba3baf
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir
@@ -0,0 +1,13 @@
+// RUN: mlir-opt --arith-int-range-narrowing="int-bitwidths-supported=32" --split-input-file %s | FileCheck %s
+
+// CHECK-LABEL: func @narrow
+//       CHECK:   %[[SRC:.*]] = test.with_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index} : index
+//       CHECK:   %[[CAST1:.*]] = arith.index_castui %[[SRC]] : index to i32
+//       CHECK:   %[[VAL:.*]] = amdgpu.assume_subgroup_uniform %[[CAST1]] : i32
+//       CHECK:   %[[CAST2:.*]] = arith.index_castui %[[VAL]] : i32 to index
+//       CHECK:   return %[[CAST2]] : index
+func.func @narrow() -> index {
+  %0 = test.with_bounds { umin = 0 : index, umax = 10 : index, smin = 0 : index, smax = 10 : index } : index
+  %1 = amdgpu.assume_subgroup_uniform %0 : index
+  return %1: index
+}
diff --git a/mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir b/mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir
new file mode 100644
index 0000000000000..9be2b5dda267e
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir
@@ -0,0 +1,21 @@
+// RUN: mlir-opt %s --loop-invariant-code-motion | FileCheck %s
+
+func.func private @side_effect(%arg0 : f32, %arg1 : f32)
+
+// CHECK-LABEL: func @assume_subgroup_uniform_hoisting
+//  CHECK-SAME: (%[[ARG:.*]]: f32)
+func.func @assume_subgroup_uniform_hoisting(%arg0 : f32) {
+  %c0 = arith.constant 0 : index
+  %c1 = arith.constant 1 : index
+  %c10 = arith.constant 10 : index
+// CHECK: %[[V1:.*]] = amdgpu.assume_subgroup_uniform all_lanes %[[ARG]] : f32
+// CHECK: scf.for
+// CHECK: %[[V0:.*]] = amdgpu.assume_subgroup_uniform %[[ARG]] : f32
+// CHECK: func.call @side_effect(%[[V0]], %[[V1]])
+  scf.for %i = %c0 to %c10 step %c1 {
+    %0 = amdgpu.assume_subgroup_uniform %arg0 : f32
+    %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : f32
+    func.call @side_effect(%0, %1) : (f32, f32) -> ()
+  }
+  func.return
+}

@llvmbot
Copy link
Member

llvmbot commented Aug 8, 2025

@llvm/pr-subscribers-mlir-amdgpu

Author: Ivan Butygin (Hardcode84)

Changes

assume_subgroup_uniform works as compiler hint to force the specific value into scalar register.
Currently implemented via readfirstlane intrinsic.
Unlike direct readfirstlane call, this op is potentially speculatable and have a usual arith and int range interfaces.


Full diff: https://github.com/llvm/llvm-project/pull/152740.diff

9 Files Affected:

  • (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+35-3)
  • (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h (+1)
  • (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+16-1)
  • (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+16)
  • (modified) mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir (+14)
  • (modified) mlir/test/Dialect/AMDGPU/canonicalize.mlir (+10)
  • (modified) mlir/test/Dialect/AMDGPU/ops.mlir (+10)
  • (added) mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir (+13)
  • (added) mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir (+21)
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index 2c646934c11c2..b0b94ed49f2e5 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -9,12 +9,13 @@
 #ifndef AMDGPU
 #define AMDGPU
 
+include "mlir/IR/EnumAttr.td"
+include "mlir/IR/OpBase.td"
+include "mlir/IR/Properties.td"
+include "mlir/Interfaces/InferIntRangeInterface.td"
 include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
 include "mlir/Interfaces/ViewLikeInterface.td"
-include "mlir/IR/EnumAttr.td"
-include "mlir/IR/Properties.td"
-include "mlir/IR/OpBase.td"
 
 def AMDGPU_Dialect : Dialect {
   let name = "amdgpu";
@@ -635,6 +636,37 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
   let hasVerifier = 1;
 }
 
+def AMDGPU_AssumeSubgroupUniformOp : AMDGPU_Op<"assume_subgroup_uniform",
+    [NoMemoryEffect, AllTypesMatch<["result", "src"]>,
+    DeclareOpInterfaceMethods<InferIntRangeInterface, ["inferResultRanges"]>,
+    DeclareOpInterfaceMethods<ConditionallySpeculatable, ["getSpeculatability"]>] #
+    ElementwiseMappable.traits>,
+  Arguments<(ins AnyType:$src,
+                 DefaultValuedAttr<UnitAttr, "false">:$all_lanes)> {
+  let summary = "Assumes value is unform across the lanes in subgroup";
+  let description = [{
+      This op is a compiler hint to help backend put values into scalar registers.
+
+      If `src` value is uniform across all the active subgroup lanes it is
+      returned unchanged, otherwise result is poison.
+
+      If `all_lanes` is set, the value is assumed to be uniform across all the
+      subgroup lanes, this can allow to speculate it out of control flow, which
+      may change the current active lanes, i.e:
+      ```
+      // %value must be uniform at this point
+      %value = ...
+      scf.if lane_id < 13 {
+        %uniform = amdgpu.assume_subgroup_uniform all_lanes %value
+      }
+      ```
+  }];
+  let results = (outs AnyType:$result);
+  let assemblyFormat = [{
+    (`all_lanes` $all_lanes^)? $src attr-dict `:` type($result)
+  }];
+}
+
 def AMDGPU_SwizzleBitModeOp : AMDGPU_Op<"swizzle_bitmode",
     [Pure, AllTypesMatch<["result", "src"]>]>,
   Arguments<(ins AnyIntegerOrFloatOr1DVector:$src,
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h
index 3de57c923178a..196ce08b5954c 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h
@@ -18,6 +18,7 @@
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/Dialect.h"
 #include "mlir/IR/OpDefinition.h"
+#include "mlir/Interfaces/InferIntRangeInterface.h"
 #include "mlir/Interfaces/InferTypeOpInterface.h"
 #include "mlir/Interfaces/SideEffectInterfaces.h"
 #include "mlir/Interfaces/ViewLikeInterface.h"
diff --git a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
index 64720bfe6cf50..3f52309005690 100644
--- a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
+++ b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
@@ -1876,6 +1876,19 @@ struct AMDGPUSwizzleBitModeLowering
   }
 };
 
+struct AMDGPUAssumeSubgroupUniformLowering
+    : public ConvertOpToLLVMPattern<AssumeSubgroupUniformOp> {
+  using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern;
+
+  LogicalResult
+  matchAndRewrite(AssumeSubgroupUniformOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    Value src = adaptor.getSrc();
+    rewriter.replaceOpWithNewOp<ROCDL::ReadfirstlaneOp>(op, src.getType(), src);
+    return success();
+  }
+};
+
 struct ConvertAMDGPUToROCDLPass
     : public impl::ConvertAMDGPUToROCDLPassBase<ConvertAMDGPUToROCDLPass> {
   using Base::Base;
@@ -1945,5 +1958,7 @@ void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,
            PackedScaledTruncOpLowering, PackedTrunc2xFp8OpLowering,
            PackedStochRoundFp8OpLowering, GatherToLDSOpLowering,
            TransposeLoadOpLowering>(converter, chipset);
-  patterns.add<AMDGPUSwizzleBitModeLowering>(converter);
+  patterns
+      .add<AMDGPUSwizzleBitModeLowering, AMDGPUAssumeSubgroupUniformLowering>(
+          converter);
 }
diff --git a/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp b/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
index d7ffdcb58ddb5..0115a85ba0bfe 100644
--- a/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
+++ b/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
@@ -510,6 +510,22 @@ LogicalResult DPPOp::verify() {
   return success();
 }
 
+//===----------------------------------------------------------------------===//
+// AssumeSubgroupUniformOp
+//===----------------------------------------------------------------------===//
+
+void AssumeSubgroupUniformOp::inferResultRanges(
+    ArrayRef<ConstantIntRanges> argRanges, SetIntRangeFn setResultRange) {
+  setResultRange(getResult(), argRanges.front());
+}
+
+Speculation::Speculatability AssumeSubgroupUniformOp::getSpeculatability() {
+  if (getAllLanes())
+    return Speculation::Speculatable;
+
+  return Speculation::NotSpeculatable;
+}
+
 //===----------------------------------------------------------------------===//
 // GatherToLDSOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir b/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
index cc1162d8b0de8..6eaf68f84e38f 100644
--- a/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
+++ b/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
@@ -461,3 +461,17 @@ func.func @sched_barrier() {
   amdgpu.sched_barrier allow = <valu|all_vmem>
   func.return
 }
+
+// CHECK-LABEL: func @assume_subgroup_uniform
+//  CHECK-SAME:   (%[[ARG:.*]]: index)
+func.func @assume_subgroup_uniform(%arg0 : index) -> (index, index) {
+//       CHECK:   %[[SRC:.*]] = builtin.unrealized_conversion_cast %[[ARG]] : index to i64
+//       CHECK:   %[[V1:.*]] = rocdl.readfirstlane %[[SRC]] : i64
+//       CHECK:   %[[RES1:.*]] = builtin.unrealized_conversion_cast %[[V1]] : i64 to index
+//       CHECK:   %[[V2:.*]] = rocdl.readfirstlane %[[SRC]] : i64
+//       CHECK:   %[[RES2:.*]] = builtin.unrealized_conversion_cast %[[V2]] : i64 to index
+//       CHECK:   return %[[RES1]], %[[RES2]] : index, index
+  %0 = amdgpu.assume_subgroup_uniform %arg0 : index
+  %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : index
+  func.return %0, %1 : index, index
+}
diff --git a/mlir/test/Dialect/AMDGPU/canonicalize.mlir b/mlir/test/Dialect/AMDGPU/canonicalize.mlir
index 5501ad42dbd90..141bd3f459738 100644
--- a/mlir/test/Dialect/AMDGPU/canonicalize.mlir
+++ b/mlir/test/Dialect/AMDGPU/canonicalize.mlir
@@ -159,3 +159,13 @@ func.func @fold_gather_to_lds_of_cast_dest(%global: memref<128x72xf32, 1>, %lds:
     : f32, memref<128x72xf32, 1>, memref<?x?xf32, 3>
   func.return
 }
+
+// -----
+
+// CHECK-LABEL: func @assume_subgroup_uniform_unused
+func.func @assume_subgroup_uniform_unused(%arg0 : f32) {
+// CHECK-NOT: amdgpu.assume_subgroup_uniform
+  %0 = amdgpu.assume_subgroup_uniform %arg0 : f32
+  %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : f32
+  func.return
+}
diff --git a/mlir/test/Dialect/AMDGPU/ops.mlir b/mlir/test/Dialect/AMDGPU/ops.mlir
index 87e11c028c62a..97b4d5f54506f 100644
--- a/mlir/test/Dialect/AMDGPU/ops.mlir
+++ b/mlir/test/Dialect/AMDGPU/ops.mlir
@@ -517,6 +517,16 @@ func.func @wmma(%arg0 : vector<16xf16>, %arg1 : vector<8xf16>) -> vector<8xf16>
   func.return %0 : vector<8xf16>
 }
 
+// CHECK-LABEL: func @assume_subgroup_uniform
+//  CHECK-SAME: (%[[ARG:.*]]: f32)
+func.func @assume_subgroup_uniform(%arg0 : f32) -> (f32, f32) {
+  // CHECK: amdgpu.assume_subgroup_uniform %[[ARG]] : f32
+  %0 = amdgpu.assume_subgroup_uniform %arg0 : f32
+  // CHECK: amdgpu.assume_subgroup_uniform all_lanes %[[ARG]] : f32
+  %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : f32
+  func.return %0, %1 : f32, f32
+}
+
 // CHECK-LABEL: func @swizzle_bitmode
 func.func @swizzle_bitmode(%arg0 : f32) -> f32 {
   // CHECK: amdgpu.swizzle_bitmode
diff --git a/mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir b/mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir
new file mode 100644
index 0000000000000..be20bfdba3baf
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/subgroup-uniform-int-range.mlir
@@ -0,0 +1,13 @@
+// RUN: mlir-opt --arith-int-range-narrowing="int-bitwidths-supported=32" --split-input-file %s | FileCheck %s
+
+// CHECK-LABEL: func @narrow
+//       CHECK:   %[[SRC:.*]] = test.with_bounds {smax = 10 : index, smin = 0 : index, umax = 10 : index, umin = 0 : index} : index
+//       CHECK:   %[[CAST1:.*]] = arith.index_castui %[[SRC]] : index to i32
+//       CHECK:   %[[VAL:.*]] = amdgpu.assume_subgroup_uniform %[[CAST1]] : i32
+//       CHECK:   %[[CAST2:.*]] = arith.index_castui %[[VAL]] : i32 to index
+//       CHECK:   return %[[CAST2]] : index
+func.func @narrow() -> index {
+  %0 = test.with_bounds { umin = 0 : index, umax = 10 : index, smin = 0 : index, smax = 10 : index } : index
+  %1 = amdgpu.assume_subgroup_uniform %0 : index
+  return %1: index
+}
diff --git a/mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir b/mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir
new file mode 100644
index 0000000000000..9be2b5dda267e
--- /dev/null
+++ b/mlir/test/Dialect/AMDGPU/subgroup-uniform-speculability.mlir
@@ -0,0 +1,21 @@
+// RUN: mlir-opt %s --loop-invariant-code-motion | FileCheck %s
+
+func.func private @side_effect(%arg0 : f32, %arg1 : f32)
+
+// CHECK-LABEL: func @assume_subgroup_uniform_hoisting
+//  CHECK-SAME: (%[[ARG:.*]]: f32)
+func.func @assume_subgroup_uniform_hoisting(%arg0 : f32) {
+  %c0 = arith.constant 0 : index
+  %c1 = arith.constant 1 : index
+  %c10 = arith.constant 10 : index
+// CHECK: %[[V1:.*]] = amdgpu.assume_subgroup_uniform all_lanes %[[ARG]] : f32
+// CHECK: scf.for
+// CHECK: %[[V0:.*]] = amdgpu.assume_subgroup_uniform %[[ARG]] : f32
+// CHECK: func.call @side_effect(%[[V0]], %[[V1]])
+  scf.for %i = %c0 to %c10 step %c1 {
+    %0 = amdgpu.assume_subgroup_uniform %arg0 : f32
+    %1 = amdgpu.assume_subgroup_uniform all_lanes %arg0 : f32
+    func.call @side_effect(%0, %1) : (f32, f32) -> ()
+  }
+  func.return
+}

@jhuber6 jhuber6 requested a review from arsenm August 8, 2025 15:31
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make sense to have this in the gpu dialect? I think it should be portable and generalize to other architectures.

If `all_lanes` is set, the value is assumed to be uniform across all the
subgroup lanes, this can allow to speculate it out of control flow, which
may change the current active lanes, i.e:
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
```mlir

Comment on lines +653 to +654
If `all_lanes` is set, the value is assumed to be uniform across all the
subgroup lanes, this can allow to speculate it out of control flow, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this isn't the only allowed scenario? If it does not have to be uniform, I think we should call it get_first_lane

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that should be handled in LLVM IR, not MLIR. This is also path dependent, so should look more like an assume call than something propagating a value. If you're just going to insert readfirstlane you can just directly insert readfirstlane (i.e. I don't see what a wrapper around it does for you)

@krzysz00
Copy link
Contributor

krzysz00 commented Aug 8, 2025

  1. Per discussion offline, this should be a gpu.* op that lets us also lower to SPIR-V, Nvidia, etc.
  2. @arsenm The main motivation behind this op was to allow for readfirstlane to be more speculatable than in usually is by letting you stick a "trust me" on it ... though we're also realizing that there's an abstraction we want around readfirstlane, the equivalent SPIR-V op, Nvidia's equivalent, etc. .

Is there a way to make readfirstlane more sepculatable?

To give higher-level context. The "I'm not reading all that" context is over in https://gist.github.com/Hardcode84/7d8c81d4081d8bb99a3bf74054ff7768, which has some IREE MLIR and some of the generated LLVM IR and assembly.

Even after applying #152581 to make the LDS subgroup indices recognizable as uniform, the readfirstlane for the move to m0 gets inserted inside the loop where the direct-to-LDS loads are, not outside the loop where the pointer is computed.

That is, we have

l0 = f(tidx >> 6, tidy) ; uniform, vgpr
l1 = l0 + C1
l2 = l0 + C2
...
for (k from 0 upto K) {
  load_lds(ptr addrspace(1/7) g(K, tidx, tidy, 0), ptr addrspace(3) l0, ...)
  load_lds(g(K, tidx, tidy, 1), l1, ...)
  ... [barriers, compute stuff, etc. this loop might be unrolled]
}

and because the l_i are computed in VGPRs, they only get readfirstlane'd automatically right ta the point where that transition is needed ... and then it doesn't get hoisted either because it's too late for that or because that's not a trivial rewrite or because no one's added logic for that.

So adding an MLIR-level abstraction around readfirstlane will not only allow us to make it platform-independent, but it'll allow us (by way of letting us stick an assertion of speculatability on it) to LICM it.

@Hardcode84
Copy link
Contributor Author

Closing in favor of #152808

@arsenm
Copy link
Contributor

arsenm commented Aug 12, 2025

Is there a way to make readfirstlane more sepculatable?

Either a readanylane intrinsic. Or possibly in the future of convergence tokens we could do something with that

@krzysz00
Copy link
Contributor

Is a readanylane intrinsic something that makes sense?

@arsenm
Copy link
Contributor

arsenm commented Aug 13, 2025

Is a readanylane intrinsic something that makes sense?

We have a readanylane pseudo in the new regbankselect path, so maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants